Skip to content

Conversation

benjeffery
Copy link
Contributor

Fixes #397

@benjeffery benjeffery force-pushed the remove_bed_reader branch from a568103 to 36def1b Compare May 22, 2025 13:58
@coveralls
Copy link
Collaborator

coveralls commented May 22, 2025

Coverage Status

coverage: 98.212% (+0.03%) from 98.182%
when pulling 08ceb74 on benjeffery:remove_bed_reader
into b87be4c on sgkit-dev:main.

@benjeffery
Copy link
Contributor Author

Plink tests run somewhat faster with this change! 10% or so

@jeromekelleher
Copy link
Contributor

Haha, well, there you go! We'll need to modularise and test more, but I think that'll be quicker than dealing with all the details about packaging.

One quick question: is there a reason you didn't use pandas for the text files? I think this would be worthwhile and it would mean that we can port over the sgkit conversion functions like read_fam I'm happy with pandas as a dependency, it's pretty much universal now.

@jeromekelleher
Copy link
Contributor

@tomwhite - do you see any issues with dropping bed_reader and using the lookup table approach here for decoding bed? I see a lot of advantages...

@tomwhite
Copy link
Contributor

@tomwhite - do you see any issues with dropping bed_reader and using the lookup table approach here for decoding bed? I see a lot of advantages...

I'm pleased to see this. It's not like bed_reader supports PLINK 2, so doing this wouldn't cut off a future migration path. We probably want more tests for corner cases that bed_reader supports.

Can you do the same thing for BGEN? 😄

@benjeffery
Copy link
Contributor Author

is there a reason you didn't use pandas for the text files

Didn't want to add a dependency while removing one! If you're happy with pandas then happy to switch to it.

@jeromekelleher
Copy link
Contributor

Can you do the same thing for BGEN?

No - it's too complicated for this kind of treatment unfortunately

@jeromekelleher
Copy link
Contributor

I'm happy to commit an initial version of this that just removes the bed_reader dep @benjeffery, and we can log issues for porting in the sgkit auxiliary file reading code and adding more tests for the data reading.

@benjeffery
Copy link
Contributor Author

While writing this, I noticed that we're not storing the plink "family ID" (or parent ids) anywhere. Should those be included in the zarr?

@jeromekelleher
Copy link
Contributor

Leaving them out for now, there's an issue open to track

@jeromekelleher
Copy link
Contributor

I'll follow up on this here @benjeffery, I'm going to tack on some commits to move in the sgkit parsers and tests and will modularise the reader.

@jeromekelleher jeromekelleher marked this pull request as ready for review May 22, 2025 21:26
@jeromekelleher
Copy link
Contributor

I've added some tests that generates a bunch of BED files using bed_reader, and I think it's looking solid. I haven't looked at performance at all - will need to source a big plink tomorrow and try it out.

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor Author

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for tidying up my rushed prototype!

if magic != b"\x6c\x1b\x01":
raise ValueError("Invalid BED file magic bytes")

# We could check the size of the bed file here, but that would
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about streams - I guess there is no way to know if the user has inconsistent bim/fam/bed files, some combinations of which would just give silently corrupted data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's just how it is in the real world. There's no point in ruling out useful functionality just to do checking that other people don't bother with anyway

@jeromekelleher jeromekelleher added this pull request to the merge queue May 23, 2025
Merged via the queue into sgkit-dev:main with commit 6eb88ed May 23, 2025
15 checks passed
@benjeffery benjeffery deleted the remove_bed_reader branch May 23, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate decoding BED in numpy
4 participants